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We introduce a semi-supervised learning estimator which tends to the first kernel principal com- 
ponent as the number of labeled points vanishes. Our approach is based on the notion of optimal 
target vector, which is defined as follows. Given an input data-set of x values, the optimal target 
vector y is such that treating it as the target and using kernel ridge regression to model the de- 
pendency of y on x, the training error achieves its minimum value. For an unlabeled data set, the 
first kernel principal component is the optimal vector. In the case one is given a partially labeled 
data set, still one may look for the optimal target vector minimizing the training error. We use 
this new estimator in two directions. As a substitute of kernel principal component analysis, in the 
case one has some labeled data, to produce dimensionality reduction. Second, to develop a semi- 
supervised regression and classification algorithm for transductive inference. We show application 
of the proposed method in both directions. 

I. INTRODUCTION 

The problem of effectively combining unlabeled data with labeled data, semi-supervised learning, is of central impor- 
tance in machine learning; see, for example, [HQ, HI and references therein. Semi-supervised learning methods usually 
assume that adjacent points and/or points in the same structure (group, cluster) should have similar labels; one may 
assume that data are situated on a low dimensional manifold which can be approximated by a weighted discrete 
graph whose vertices are identified with the empirical (labeled and unlabeled) data points. This can be seen as a 
form of regularization Q. A common feature of these methods, see also is that, as the number of labeled points 
vanishes, the solution tends to the constant vector. An interesting survey on semi-supervised learning literature may 
be found on the web . Improving regression with unlabeled data is the problem considered in Q , where co-training 
is achieved using k-NN regressors. A statistical physics approach, based on the Potts model, is described in An 
issue closely related to semi-supervised learning is active-learning: some attempts to combine active learning and 
semi-supervised learning has been made 0. 

The purpose of this work is to introduce a semi-supervised learning estimator which, as the number of labeled points 
vanishes, tends to the first kernel princi pal component |lfj |: when a suitable number of labeled points is available, it 
may be used for transductive inference Our approach is based on the following fact. Given an unlabeled data 
set, its first kernel principal component is such that, treating it as target vector, supervised kernel ridge regression 
provides the minimum training error. Now, suppose that you are given a partially labeled data set: still one may look 
for the target vector minimizing the training error. This optimal target vector may be seen as the generalization of 
the first kernel principal component to the semi-supervised case. 

The paper is organized as follows. In the next Section we describe our approach, while in Section 3 the experiments 
we performed are described. Some conclusions are drawn in Section 4. 

II. METHODS 

A. Kernel ridge regression 

We briefly recall the properties of kernel ridge regression (KRR), while referring the reader to ^3] f° r further 
technical details. Let us consider a set of £ independent, identically distributed data S = {(x^, t/i)}f =1 , where x; is 
the n-dimensional vector of input variables and yi is the scalar output variable. Data are drawn from an unknown 
probability distribution; we assume that both x and y have been centered, i.e. they have been linearly transformed 
to have zero mean. The regularized linear predictor is y — w • x, where w minimizes the following functional: 

I 

^wH^fe-wx^ + AllwH 2 . (1) 
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Here ||w|| = -y/w ■ w and A > is the regularization parameter. For A = 0, predictor Q is invariant when new 
variables, statistically independent of input and target variables, are added to the set of input variables (IIV property, 
[lflj)- One may show that this invariance property holds, for Q, also at finite A > 0. 

KRR is the kernel version of the previous predictor. Calling y = (yi, y2, ye) T the vector formed by the £ values 
of the output variable and K(-, •) being a positive definite symmetric function, the predictor has the following form: 



y = /(x) = 5^c j A'(x i ,x), (2) 



where coefficients {ci} are given by 



c=(K + AI)- 1 y, (3) 

K being the i x I matrix with elements if(xj,Xj). Equation J5J may be seen to correspond to a linear predictor 
in the feature space $(x) = ( v /aTrV>i(x), V / O2'02(x), y / aT/\r!/>Ar(x), ...),where on and ipi are the eigenvalues and 
eigenf unctions of the integral operator with kernel K. One may show Jjj that, for KRRpredictors with nonlinear 
kernels, the IIV property does not generically hold, even for those kernels, discussed in |l3j . for which the property 
holds at A = 0. Regularization breaks the IIV invariance in those cases. 

Due to J5J and the predicted output vector y, in correspondence of the true target vector y, is given by y= Gy, 
where the symmetric matrix G is given by 

G=K(K + AI) _1 . (4) 

Note that matrix G depends only on the distribution of {x} values: G embodies information about the structures 
present in {x} data set. Indeed, for i ^ j, the matrix element Gij quantifies how much the target value of the j — th 
point influences the estimate of the target of point i. Let us now consider the leave-one-out scheme; let data point i be 
removed from the data set and the model be trained using the remaining I — 1 points. We denote yi the target value 
thus predicted, in correspondence of Xi. It is well known Il2j that the leave-one-out-error yi — yi and the training 
error obtained using the whole data set y~i — yi satisfy: 

Hi ~ Hi / c \ 

Vi-Vi = -~ — (5) 

This formula shows that the closer Ga to one, the farther the leave-one-out predicted value from those obtained using 
also point i in the training stage. Consider a point i in a dense region of the feature space: one may expect that 
removing this point from the data-set would not change much the estimate since it can be well predicted on the basis 
of values of neighboring points. Therefore points in low density regions of the feature space are characterized by 
diagonal values Ga close to one, while Ga is close to zero for points Xj in dense regions: the diagonal elements of G 
thus convey information about the structure of points in the feature space. It is worth stressing that, given a kernel 
function, the corresponding features ipj (x) are not centered in general. One can show p"fj| that centering the features 
(ip-y — * ipj — (tp-y), for all 7) amounts to perform the following transformation on the kernel matrix: 

K -> K = K I^K KI^ + I e Kl e , 

where (Zf)^ = 1/^, and to work with the centered kernel K. In the following we will assume that the kernel matrix 
K has been centered. 



B. Optimal target vector 

The training error of the KRR model is proportional to (y — Gy) T (y — Gy) = y T Hy, where H = I 2G + GG 
is a symmetric and positive matrix. In the unsupervised case the data set is made of x points, {xj}| =l! the target 
function y is missing. However we may pose the following question: what is the vector y 6 R/ such that treating 
it as the target vector leads to the best fit, i.e. the minimum training error y T Hy? We expect that this optimal 
target vector would bring information about the structures present in the data. To avoid the trivial solution y = 0, 
we constrain the target vector to have unit norm, y T y = 1; it follows that the optimal vector is the normalized 
eigenvector of H with the smallest eigenvalue. On the other hand, matrix H is a function of matrix K: hence it has 
the same eigenvectors of K while the corresponding eigenvalues fin and /ik are related by the following monotonically 
decreasing correspondence: 

V fJ,K+XJ 
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Therefore, independently of A, the smallest eigenvalue of H corresponds to the largest eigenvalue of K, and the 
optimal vector coincides with the first kernel principal component. To conclude this subsection, we have shown that 
the method in [10] may be motivated also as the search for the optimal target vector. 

The notion of optimal target vector has been introduced in 15], where a kernel method for dichotomic clustering 
has been proposed, consisting in finding the ground state of a class of Ising models. 

C. Semi-supervised learning 

Now we consider the case that we are given a set S = {x,}| =1 of data points with unknown targets {ij}f =1 , and 
a set S' = {{x.j, Uj)}j_£ +1 , where N = I + m, of input-output data. Without loss of generality we assume that the 

labeled points belong to two classes, and take Uj e {— 1/yN, +l/\/N} for all j's. The N dimensional full vector of 
targets y is obtained appending {t} (unknown) and {u} (known) values: 

y = (t T u T ) T - 

Keeping the kernel and A fixed, we look for the unit norm target vector y minimizing the training error y T Hy. The 
N x N matrix H has the block structure 

/ H Hi \ 
H -{HJ H 2 J< 

where Ho is an I x I matrix. Neglecting a constant term, the optimal vector is determined by the vector t minimizing 

£(t) =t T H t + 2t T Hiu (6) 

under the constraint ||t|| 2 = 1 — ||u|| 2 . The first term of S favors projections of the I points with great variance, whereas 
the second term measures their consistency with labeled points. Let us denote {'Iv} and {/i Q '} the eigenvectors and 
eigenvalues of H , sorted into increasing /i Q /. We express t = 52 a / =1 Ca'^a'- The coefficients £ a i for the minimum 
are given by 

SCt' ! 

fl - (J, a > 

where f a t = 'J'J/Hiu, and fj, is a Lagrange multiplier which must to be tuned to satisfy: 

»M = Efe) 2 = i-Hi>. m 

Equation Q has always at least one solution with /i < fix, see figure 1, and usually this is the one minimizing £. 
However all the solutions of (JJJ) must be compared according to their energies £; those corresponding to the lowest 
£, y* , is then selected. Clearly as m — * one recovers the first eigenvector of Ho, i.e. the first kernel principal 
component: y* thus constitutes a generalization of the latter to the semi-supervised case. To construct the other 
generalized kernel principal components, we make the following transformation on matrix H: 

H = H P*H HP* + P*HP*, 

where P* = y*y* T is the projector on the linear subspace spanned by y*. The symmetric matrix H has the 
lowest eigenvalue equal to zero and corresponding to eigenvector y*. The system of eigenvectors of H constitutes a 
generalization of kernel principal components to the semi-supervised case. 

III. EXPERIMENTS 

A. Generalizing kernel principal components 

Now we present some simulations of the proposed method, focusing on the dimensionality reduction issue and 
comparing with fully unsupervised kernel principal component analysis. We consider three well known data sets: 
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TABLE I: The mean square error on the Boston data set obtained using the optimal target (OT) approach and the classical 
kernel ridge regression (KRR) method. The size of the test set is I. 



i 


OT 


KRR 


5 


2.3790 


3.6312 


10 


2.7938 


4.0111 


15 


2.9460 


4.1057 


20 


3.1024 


4.1802 


25 


3.1569 


4.1653 



IRIS (100 points in a four-dimensional space, second and third classes, versicolor and virginica); colon cancer data 
set of 16], consisting in 40 tumor and 22 normal colon tissues samples, each sample being described by the 100 most 
discriminant genes; the leukemia data set of consisting of samples of tissues of bone marrow samples, 47 affected 
by acute myeloid leukemia (AML) and 25 by acute lymphoblastic leukemia (ALL), each sample being described by 
the 500 most discriminant genes. The following question is addressed: is y* more correlated to the true labels than 
the fully unsupervised first kernel principal component? Here we restrict our analysis to the linear kernel. 

We start with IRIS and proceed as follows. We randomly select m — 4 points and, treating them as labeled, we 
find the system of eigenvectors of H. Then we evaluate the linear correlation R between the eigenvectors and the true 
labels of the whole data-set. The distributions of R for the four eigenvectors are depicted in figure 2. We observe that 
in most cases the vector y* is more correlated with the true classes than the fully unsupervised principal component: 
the one-dimensional projection of data onto y* is more informative than the first principal component. However there 
are situations where use of labeled points leads to poor results; a typical example is depicted in figure 3. In figure 4 
a situation is depicted where knowledge of labeled points leads to a relevant improvement. 

In general, we denote / the fraction of instances such that y* is more correlated to the true labels than the first 
principal component. In figure 5 we depict / as a function of rh — m/N for the three data sets here considered. 
At m = 0.16 / is already nearly one. The semi-supervised method here proposed outperforms principal components 
almost always for large m. 

B. Transductive inference 

In this subsection we demonstrate the effectiveness of the proposed approach for estimating the values of a function 
at a set of test points, given a set of input-output data points, without estimating (as an intermediate step) the 
regression function. 

The boston data set is a well-known problem where one is required to estimate house prices according to various 
statistics based on 13 locational, economic and structural features from data collected by U.S. Census Service in the 
Boston Massachusetts area. For £ = 5, 10, 15, 20, 25, we partition the data-set of N = 506 observations randomly 100 
times into a training set of N — £ observations and a testing set of I observations. We use a Gaussian kernel with 
cr = l and set A = 1; results are stable against variations of these parameters. In Table 1 we report the mean squared 
error (MSE) on the test set averaged over the 100 runs, for each value of £, we obtain using the optimal target vector 
y*. In Table 1 we also report the MSE obtained using the classical KRR in the two step procedure: (i) estimation of 
the regression function using the training data-set (ii) calculation of the regression function at points of interest (test 
data-set). The improvement achieved using the optimal target approach, over classical KRR, is clear. 

We also consider five well known data sets of pattern recognition from UCI database: we evaluate the optimal target 
vector, points are then attributed to classes according to the sign of y*. We compare with the transductive linear 
discrimination (TLB) approach developed in 0; the performance of a classifier is measured by its average error over 
100 partitions of the data-sets into training and testing sets. We use the linear kernel with A = 1, however the results 
are stable to variations of A. Obviously, our approach and TLB are applied to the same partitions of data-sets, so 
that the comparison is meaningful. The results are shown in Table 2: our approach outperforms TLB. 

It is worth stressing that our results are obtained without a fine-tuning of parameters. In particular, note that our 
definition of optimal target vector fixes the relative importance of the two terms in equation JfjJ. 

IV. CONCLUSIONS 

We have presented a new approach to semi-supervised learning based on the notion of optimal target vector, the 
target vector such that KRR provides the minimum training error over all the possible target vectors. The proposed 
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TABLE II: The percentage test error of transductive linear discrimination and optimal target approach, on five datasets from 



UCI database. 

TLD OT 

Diabetes 23^3 11.98 

Titanic 22.4 6.52 

Breast Cancer 25.7 16.7 

Heart 15.7 3.3 

Thyroid 4.0 4.0 



algorithm is characterized by the fact that the first kernel principal component is recovered as the cardinality of 
labeled points vanishes; hence it may be seen as a semi-supervised generalization of Kernel Principal Components 
Analysis. The effectiveness of the proposed approach for transductive inference has also been demonstrated. 

Acknoledgements. The authors thank Olivier Chapelle for a valuable correspondence on the subject of this paper. 
Discussions on semi-supervised learning with Eytan Domany and Noam Shental are warmly acknowledged. 
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FIG. 1: The solutions of equation £J are depicted, for a typical instance of four labeled points in the IRIS data set. The star 
corresponds to the solution with n < fj,\, which has the smallest energy £. 
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FIG. 2: Concerning IRIS data set and m — 4, we depict the distribution (over 10000 random selections of labeled points) of 
the linear correlation R between eigenvectors of H and the true labels. From the left to the right and the top to the bottom, 
we refer to the first, the second, the third and the fourth eigenvector. Grey (black) histogram bars denote values of R lower 
(greater) than those of the corresponding fully unsupervised principal component. 
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FIG. 3: (Top) The IRIS data set is depicted in the plane of the first two principal components, * versicolor, + virginica. The 
linear correlation of the first principal component with the true labels is R = 0.732. Four selected points are surrounded by a 
circle. (Bottom) The data set is represented in the plane of the first two eigenvectors of H. The linear correlation between y* 
and the true labels is 7? = 0.615. (Note that two circles are almost overlapping and thus difficult to distinguish). 
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FIG. 4: (Top) The IRIS data set is depicted in the plane of the first two principal components, * versicolor, + virginica. Four 
selected points are surrounded by a circle. (Bottom) The data set is represented in the plane of the first two eigenvectors of 
H. The linear correlation between y* and the true labels is, in this case, R = 0.846. 




FIG. 5: The fraction / (see the text) is depicted as a function of m for three data sets here considered. 10000 random selections 
of the labeled points are considered for each value of m and for each data-set. 
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Here ||w|| = -y/w ■ w and A > is the regularization parameter. For A = 0, predictor Q is invariant when new 
variables, statistically independent of input and target variables, are added to the set of input variables (IIV property, 
[lflj)- One may show that this invariance property holds, for Q, also at finite A > 0. 

KRR is the kernel version of the previous predictor. Calling y = (yi, y2, ye) T the vector formed by the £ values 
of the output variable and K(-, •) being a positive definite symmetric function, the predictor has the following form: 



y = /(x) = 5^c j A'(x i ,x), (2) 



where coefficients {ci} are given by 



c=(K + AI)- 1 y, (3) 

K being the i x I matrix with elements if(xj,Xj). Equation J5J may be seen to correspond to a linear predictor 
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Due to J5J and the predicted output vector y, in correspondence of the true target vector y, is given by y= Gy, 
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the method in [10] may be motivated also as the search for the optimal target vector. 

The notion of optimal target vector has been introduced in 15], where a kernel method for dichotomic clustering 
has been proposed, consisting in finding the ground state of a class of Ising models. 

C. Semi-supervised learning 

Now we consider the case that we are given a set S = {x,}| =1 of data points with unknown targets {ij}f =1 , and 
a set S' = {{x.j, Uj)}j_£ +1 , where N = I + m, of input-output data. Without loss of generality we assume that the 

labeled points belong to two classes, and take Uj e {— 1/yN, +l/\/N} for all j's. The N dimensional full vector of 
targets y is obtained appending {t} (unknown) and {u} (known) values: 

y = (t T u T ) T - 

Keeping the kernel and A fixed, we look for the unit norm target vector y minimizing the training error y T Hy. The 
N x N matrix H has the block structure 

/ H Hi \ 
H -{HJ H 2 J< 

where Ho is an I x I matrix. Neglecting a constant term, the optimal vector is determined by the vector t minimizing 

£(t) =t T H t + 2t T Hiu (6) 

under the constraint ||t|| 2 = 1 — ||u|| 2 . The first term of S favors projections of the I points with great variance, whereas 
the second term measures their consistency with labeled points. Let us denote {'Iv} and {/i Q '} the eigenvectors and 
eigenvalues of H , sorted into increasing /i Q /. We express t = 52 a / =1 Ca'^a'- The coefficients £ a i for the minimum 
are given by 

SCt' ! 

fl - (J, a > 

where f a t = 'J'J/Hiu, and fj, is a Lagrange multiplier which must to be tuned to satisfy: 

»M = Efe) 2 = i-Hi>. m 

Equation Q has always at least one solution with /i < fix, see figure 1, and usually this is the one minimizing £. 
However all the solutions of (JJJ) must be compared according to their energies £; those corresponding to the lowest 
£, y* , is then selected. Clearly as m — * one recovers the first eigenvector of Ho, i.e. the first kernel principal 
component: y* thus constitutes a generalization of the latter to the semi-supervised case. To construct the other 
generalized kernel principal components, we make the following transformation on matrix H: 

H = H P*H HP* + P*HP*, 

where P* = y*y* T is the projector on the linear subspace spanned by y*. The symmetric matrix H has the 
lowest eigenvalue equal to zero and corresponding to eigenvector y*. The system of eigenvectors of H constitutes a 
generalization of kernel principal components to the semi-supervised case. 

III. EXPERIMENTS 

A. Generalizing kernel principal components 

Now we present some simulations of the proposed method, focusing on the dimensionality reduction issue and 
comparing with fully unsupervised kernel principal component analysis. We consider three well known data sets: 
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TABLE I: The mean square error on the Boston data set obtained using the optimal target (OT) approach and the classical 
kernel ridge regression (KRR) method. The size of the test set is I. 



i 


OT 


KRR 


5 


2.3790 


3.6312 


10 


2.7938 


4.0111 


15 


2.9460 


4.1057 


20 


3.1024 


4.1802 


25 


3.1569 


4.1653 



IRIS (100 points in a four-dimensional space, second and third classes, versicolor and virginica); colon cancer data 
set of 16], consisting in 40 tumor and 22 normal colon tissues samples, each sample being described by the 100 most 
discriminant genes; the leukemia data set of consisting of samples of tissues of bone marrow samples, 47 affected 
by acute myeloid leukemia (AML) and 25 by acute lymphoblastic leukemia (ALL), each sample being described by 
the 500 most discriminant genes. The following question is addressed: is y* more correlated to the true labels than 
the fully unsupervised first kernel principal component? Here we restrict our analysis to the linear kernel. 

We start with IRIS and proceed as follows. We randomly select m — 4 points and, treating them as labeled, we 
find the system of eigenvectors of H. Then we evaluate the linear correlation R between the eigenvectors and the true 
labels of the whole data-set. The distributions of R for the four eigenvectors are depicted in figure 2. We observe that 
in most cases the vector y* is more correlated with the true classes than the fully unsupervised principal component: 
the one-dimensional projection of data onto y* is more informative than the first principal component. However there 
are situations where use of labeled points leads to poor results; a typical example is depicted in figure 3. In figure 4 
a situation is depicted where knowledge of labeled points leads to a relevant improvement. 

In general, we denote / the fraction of instances such that y* is more correlated to the true labels than the first 
principal component. In figure 5 we depict / as a function of rh — m/N for the three data sets here considered. 
At m = 0.16 / is already nearly one. The semi-supervised method here proposed outperforms principal components 
almost always for large m. 

B. Transductive inference 

In this subsection we demonstrate the effectiveness of the proposed approach for estimating the values of a function 
at a set of test points, given a set of input-output data points, without estimating (as an intermediate step) the 
regression function. 

The boston data set is a well-known problem where one is required to estimate house prices according to various 
statistics based on 13 locational, economic and structural features from data collected by U.S. Census Service in the 
Boston Massachusetts area. For £ = 5, 10, 15, 20, 25, we partition the data-set of N = 506 observations randomly 100 
times into a training set of N — £ observations and a testing set of I observations. We use a Gaussian kernel with 
cr = l and set A = 1; results are stable against variations of these parameters. In Table 1 we report the mean squared 
error (MSE) on the test set averaged over the 100 runs, for each value of £, we obtain using the optimal target vector 
y*. In Table 1 we also report the MSE obtained using the classical KRR in the two step procedure: (i) estimation of 
the regression function using the training data-set (ii) calculation of the regression function at points of interest (test 
data-set). The improvement achieved using the optimal target approach, over classical KRR, is clear. 

We also consider five well known data sets of pattern recognition from UCI database: we evaluate the optimal target 
vector, points are then attributed to classes according to the sign of y*. We compare with the transductive linear 
discrimination (TLB) approach developed in 0; the performance of a classifier is measured by its average error over 
100 partitions of the data-sets into training and testing sets. We use the linear kernel with A = 1, however the results 
are stable to variations of A. Obviously, our approach and TLB are applied to the same partitions of data-sets, so 
that the comparison is meaningful. The results are shown in Table 2: our approach outperforms TLB. 

It is worth stressing that our results are obtained without a fine-tuning of parameters. In particular, note that our 
definition of optimal target vector fixes the relative importance of the two terms in equation JfjJ. 

IV. CONCLUSIONS 

We have presented a new approach to semi-supervised learning based on the notion of optimal target vector, the 
target vector such that KRR provides the minimum training error over all the possible target vectors. The proposed 
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TABLE II: The percentage test error of transductive linear discrimination and optimal target approach, on five datasets from 



UCI database. 

TLD OT 

Diabetes 23^3 11.98 

Titanic 22.4 6.52 

Breast Cancer 25.7 16.7 

Heart 15.7 3.3 

Thyroid 4.0 4.0 



algorithm is characterized by the fact that the first kernel principal component is recovered as the cardinality of 
labeled points vanishes; hence it may be seen as a semi-supervised generalization of Kernel Principal Components 
Analysis. The effectiveness of the proposed approach for transductive inference has also been demonstrated. 

Acknoledgements. The authors thank Olivier Chapelle for a valuable correspondence on the subject of this paper. 
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FIG. 1: The solutions of equation £J are depicted, for a typical instance of four labeled points in the IRIS data set. The star 
corresponds to the solution with n < fj,\, which has the smallest energy £. 
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FIG. 2: Concerning IRIS data set and m — 4, we depict the distribution (over 10000 random selections of labeled points) of 
the linear correlation R between eigenvectors of H and the true labels. From the left to the right and the top to the bottom, 
we refer to the first, the second, the third and the fourth eigenvector. Grey (black) histogram bars denote values of R lower 
(greater) than those of the corresponding fully unsupervised principal component. 
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FIG. 3: (Top) The IRIS data set is depicted in the plane of the first two principal components, * versicolor, + virginica. The 
linear correlation of the first principal component with the true labels is R = 0.732. Four selected points are surrounded by a 
circle. (Bottom) The data set is represented in the plane of the first two eigenvectors of H. The linear correlation between y* 
and the true labels is 7? = 0.615. (Note that two circles are almost overlapping and thus difficult to distinguish). 
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FIG. 4: (Top) The IRIS data set is depicted in the plane of the first two principal components, * versicolor, + virginica. Four 
selected points are surrounded by a circle. (Bottom) The data set is represented in the plane of the first two eigenvectors of 
H. The linear correlation between y* and the true labels is, in this case, R = 0.846. 




FIG. 5: The fraction / (see the text) is depicted as a function of m for three data sets here considered. 10000 random selections 
of the labeled points are considered for each value of m and for each data-set. 



